training set size
The Data Addition Dilemma
Shen, Judy Hanwen, Raji, Inioluwa Deborah, Chen, Irene Y.
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the \textit{Data Addition Dilemma}, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.
- North America > United States > South Dakota (0.04)
- North America > United States > Louisiana (0.04)
- North America > United States > Pennsylvania (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Health Care Technology (0.67)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
- Health & Medicine > Diagnostic Medicine > Imaging (0.45)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
- Europe > Italy > Lombardy > Milan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Michigan (0.04)
PyEPO: A PyTorch-based End-to-End Predict-then-Optimize Library for Linear and Integer Programming
In deterministic optimization, it is typically assumed that all problem parameters are fixed and known. In practice, however, some parameters may be a priori unknown but can be estimated from historical data. A typical predict-then-optimize approach separates predictions and optimization into two stages. Recently, end-to-end predict-then-optimize has become an attractive alternative. In this work, we present the PyEPO package, a PyTorchbased end-to-end predict-then-optimize library in Python. To the best of our knowledge, PyEPO (pronounced like pineapple with a silent "n") is the first such generic tool for linear and integer programming with predicted objective function coefficients. It provides four base algorithms: a convex surrogate loss function from the seminal work of Elmachtoub and Grigas [16], a differentiable black-box solver approach of Pogancic et al. [35], and two differentiable perturbation-based methods from Berthet et al. [6]. PyEPO provides a simple interface for the definition of new optimization problems, the implementation of state-of-the-art predict-then-optimize training algorithms, the use of custom neural network architectures, and the comparison of end-to-end approaches with the two-stage approach. PyEPO enables us to conduct a comprehensive set of experiments comparing a number of end-to-end and two-stage approaches along axes such as prediction accuracy, decision quality, and running time on problems such as Shortest Path, Multiple Knapsack, and the Traveling Salesperson Problem. We discuss some empirical insights from these experiments, which could guide future research. PyEPO and its documentation are available at https://github.com/khalil-research/PyEPO.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- Energy (0.46)
- Leisure & Entertainment > Games (0.46)
- Transportation (0.34)
Test Set Sizing Via Random Matrix Theory
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Africa > Middle East > Algeria (0.04)
Eigenspace Restructuring: a Principle of Space and Frequency in Neural Networks
Understanding the fundamental principles behind the massive success of neural networks is one of the most important open questions in deep learning. However, due to the highly complex nature of the problem, progress has been relatively slow. In this note, through the lens of infinite-width networks, a.k.a. neural kernels, we present one such principle resulting from hierarchical localities. It is well-known that the eigenstructure of infinite-width multilayer perceptrons (MLPs) depends solely on the concept frequency, which measures the order of interactions. We show that the topologies from deep convolutional networks (CNNs) restructure the associated eigenspaces into finer subspaces. In addition to frequency, the new structure also depends on the concept space, which measures the spatial distance among nonlinear interaction terms. The resulting fine-grained eigenstructure dramatically improves the network's learnability, empowering them to simultaneously model a much richer class of interactions, including Long-Range-Low-Frequency interactions, Short-Range-High-Frequency interactions, and various interpolations and extrapolations in-between. Additionally, model scaling can improve the resolutions of interpolations and extrapolations and, therefore, the network's learnability. Finally, we prove a sharp characterization of the generalization error for infinite-width CNNs of any depth in the high-dimensional setting. Two corollaries follow: (1) infinite-width deep CNNs can break the curse of dimensionality without losing their expressivity, and (2) scaling improves performance in both the finite and infinite data regimes.
- North America > United States (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
A Theoretical-Empirical Approach to Estimating Sample Complexity of DNNs
Bisla, Devansh, Saridena, Apoorva Nandini, Choromanska, Anna
This paper focuses on understanding how the generalization error scales with the amount of the training data for deep neural networks (DNNs). Existing techniques in statistical learning require computation of capacity measures, such as VC dimension, to provably bound this error. It is however unclear how to extend these measures to DNNs and therefore the existing analyses are applicable to simple neural networks, which are not used in practice, e.g., linear or shallow ones or otherwise multi-layer perceptrons. Moreover, many theoretical error bounds are not empirically verifiable. We derive estimates of the generalization error that hold for deep networks and do not rely on unattainable capacity measures. The enabling technique in our approach hinges on two major assumptions: i) the network achieves zero training error, ii) the probability of making an error on a test point is proportional to the distance between this point and its nearest training point in the feature space and at a certain maximal distance (that we call radius) it saturates. Based on these assumptions we estimate the generalization error of DNNs. The obtained estimate scales as O(1/(\delta N^{1/d})), where N is the size of the training data and is parameterized by two quantities, the effective dimensionality of the data as perceived by the network (d) and the aforementioned radius (\delta), both of which we find empirically. We show that our estimates match with the experimentally obtained behavior of the error on multiple learning tasks using benchmark data-sets and realistic models. Estimating training data requirements is essential for deployment of safety critical applications such as autonomous driving etc. Furthermore, collecting and annotating training data requires a huge amount of financial, computational and human resources. Our empirical estimates will help to efficiently allocate resources.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > France (0.04)
- (6 more...)
- Education (0.49)
- Transportation > Ground > Road (0.34)
- Information Technology > Robotics & Automation (0.34)
- Automobiles & Trucks (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)
A Linear Time Active Learning Algorithm for Link Classification
Cesa-bianchi, Nicolò, Gentile, Claudio, Vitale, Fabio, Zappella, Giovanni
We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph $G = (V,E)$ such that $|E|$ is at least order of $|V|^{3/2}$ by querying at most order of $|V|^{3/2}$ edge labels. More generally, we show an algorithm that achieves optimality to within a factor of order $k$ by querying at most order of $|V| + (|V|/k)^{3/2}$ edge labels. The running time of this algorithm is at most of order $|E| + |V|\log|V|$.
- Europe > Italy > Lombardy > Milan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Michigan (0.04)